Good
Bad
Simple Linear Regression is a statistical technique used to model the relationship between two continuous variables: a dependent variable (also known as the response or outcome variable) and an independent variable (also known as the predictor or explanatory variable).
It aims to find the best-fitting straight line through the data points to predict the values of the dependent variable based on the values of the independent variable.
The equation for a simple linear regression model is typically represented as:
y = β0 + β1 * x + ε
Where: - y is the dependent variable (the variable we want to predict). - x is the independent variable (the variable used to make predictions). - β0 is the intercept, representing the value of y when x is zero. - β1 is the slope of the regression line, indicating how much y changes for each unit change in x. - ε represents the error term, which accounts for the variability of y that is not explained by the regression line.
The goal of simple linear regression is to estimate the values of β
0
and β
1
that minimize the sum of squared differences between the predicted values (β
0
+ β
1
* x
) and the actual observed values of the dependent variable. This is usually done using a method called the least squares approach.
Once the regression coefficients (β
0
and β
1
) are estimated, the fitted regression line can be used to make predictions for new values of the independent variable.
Simple linear regression is a fundamental and widely used technique in statistics and machine learning for understanding and modeling the relationship between two variables, especially when the relationship appears to be linear. However, when dealing with more complex relationships, multiple linear regression or other advanced regression techniques may be more appropriate.
Y
) and an independent variable (X
), and we want to find the best-fitting line that represents the linear relationship between them. The equation of the line is given by:Y = β0 + β1 * X
Where: - Y is the dependent variable (the one we want to predict or explain). - X is the independent variable (the predictor or explanatory variable). - β0 is the intercept (the value of Y when X is 0). - β1 is the slope (the change in Y for a one-unit change in X).
The goal of the OLS method is to find the values of β0
and β1
that minimize the sum of squared differences between the observed values of Y
(Yi
) and the predicted values (Ŷi
) from the linear equation for all data points (i
) in the dataset.
Mathematically, the OLS estimates of β0
and β1
are obtained as follows:
β1 = Σ((Xi - X̄)(Yi - Ȳ)) / Σ((Xi - X̄)2)
β0 = Ȳ - β1 * X̄
Where: - Σ represents the sum of. - Xi is the value of the independent variable for the ith data point. - Yi is the value of the dependent variable for the ith data point. - X̄ is the mean of all X values. - Ȳ is the mean of all Y values.
The OLS method is called “least squares” because it minimizes the sum of the squared vertical distances between the observed data points and the regression line. The line obtained through OLS is the “best-fitting” line because it minimizes the total squared error between the observed values and the predicted values.
Once you have estimated the values of β0
and β1
using OLS, you can use the linear equation (Y = β0 + β1 * X
) to predict the value of the dependent variable Y
for any given value of the independent variable X
. Additionally, you can assess the goodness of fit of the regression model and make inferences about the relationship between the two variables using statistical tests and measures such as R-squared, t-tests, etc.
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
# Load the data
data = pd.read_csv("1.01. Simple linear regression.csv")
# Define the dependent and the independent variables
y = data['GPA']
x1 = data['SAT']
# Explore the data
plt.scatter(x1, y)
plt.xlabel('SAT', fontsize=20)
plt.ylabel('GPA', fontsize=20)
plt.show()
# Regression itself
x = sm.add_constant(x1)
results = sm.OLS(y, x).fit()
results.summary()
# Plotting the graph
plt.scatter(x1, y)
yhat = 0.0017 * x1 + 0.275
fig = plt.plot(x1, yhat, lw = 4, c='orange', label='regression line')
plt.xlabel('SAT', fontsize=20)
plt.ylabel('GPA', fontsize=20)
plt.show()
# Formula Method --> ŷ = b₀ + b₁x₁
plt.scatter(x1, y)
yhat = 0.0017 * x1 + 0.275
fig = plt.plot(x1, yhat, lw = 4, c='orange', label='regression line')
plt.xlabel('SAT', fontsize=20)
plt.ylabel('GPA', fontsize=20)
plt.show()
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Generate some example data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a Linear Regression model
model = LinearRegression()
# Train the model on the training data
model.fit(X_train, y_train)
# Make predictions on the test data
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(f"Root Mean Squared Error: {rmse}")
# Plot the training data and the regression line
plt.scatter(X_train, y_train, label='Training Data')
plt.scatter(X_test, y_test, label='Test Data')
plt.plot(X_test, y_pred, color='red', linewidth=3, label='Regression Line')
plt.xlabel('X-axis label')
plt.ylabel('Y-axis label')
plt.title('Simple Linear Regression')
plt.legend()
plt.show()
«Previous | Next» |